Day 2: NLP Fundamentals and Terminology
To understand LLMs, you first need to know NLP (Natural Language Processing) terminology. Today we’ll organize the NLP terms that appear most frequently in LLM papers and documentation.
NLP Core Terminology
| Term | Description | Example |
|---|---|---|
| Token | The smallest processing unit of text | ”Hello world” -> [“Hello”, ” world”] |
| Corpus | A collection of text data used for training | All of Wikipedia, news article collections |
| Vocabulary | The set of all tokens the model knows | GPT-4’s vocabulary size: ~100,000 tokens |
| Embedding | Words converted into numeric vectors | ”king” -> [0.2, -0.5, 0.8, …] |
| Sequence | An ordered arrangement of tokens | A single sentence or paragraph |
| Attention | A mechanism that focuses on important parts of the input | In “He ate the apple,” determining who “he” refers to |
| Encoding | Converting input into internal representation | Sentence -> vector |
| Decoding | Converting internal representation into output | Vector -> sentence |
| Perplexity | A metric of model prediction uncertainty (lower is better) | PPL=15: roughly 15 candidates for the next word |
| Context Window | The number of tokens a model can process at once | Modern models support tens to hundreds of thousands of tokens |
Basic Tokenization Concept
# Simplest tokenization: splitting by whitespace
sentence = "Natural language processing is really fascinating"
tokens_simple = sentence.split()
print(tokens_simple)
# ['Natural', 'language', 'processing', 'is', 'really', 'fascinating']
# Real LLMs use subword tokenization
# "fascinating" -> ["fasc", "inating"] — split into smaller pieces
Intuitive Understanding of Embeddings
import numpy as np
# Embeddings: representing words as numeric vectors
# Words with similar meanings are located close together in vector space
embeddings = {
"king": np.array([0.8, 0.2, -0.5, 0.9]),
"queen": np.array([0.7, 0.3, -0.4, 0.85]),
"apple": np.array([-0.2, 0.9, 0.6, -0.1]),
}
# Measuring similarity between words using cosine similarity
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"king-queen similarity: {cosine_similarity(embeddings['king'], embeddings['queen']):.3f}")
print(f"king-apple similarity: {cosine_similarity(embeddings['king'], embeddings['apple']):.3f}")
# king-queen: high similarity / king-apple: low similarity
Perplexity Calculation Example
import numpy as np
# Perplexity: how well a model predicts the next word
# PPL = exp(average cross-entropy loss)
def calculate_perplexity(loss):
return np.exp(loss)
good_model_loss = 2.7 # Well-trained model
bad_model_loss = 5.5 # Poorly trained model
print(f"Good model PPL: {calculate_perplexity(good_model_loss):.1f}")
print(f"Bad model PPL: {calculate_perplexity(bad_model_loss):.1f}")
# Lower PPL means better next-word prediction
NLP terminology cannot be memorized in a single day. Use this table as a reference as we dive deeper into each concept in the days ahead.
Today’s Exercises
- Tokenize the sentence “Artificial intelligence is changing the world” by whitespace, by syllable, and by meaning. Explain the differences.
- Summarize the pros and cons of larger embedding vector dimensions. Compare Word2Vec (300 dimensions) with modern large model embeddings (higher dimensions).
- Think about what it means if Perplexity equals 1, and whether a model with PPL=1 is achievable in practice.